Adding FSDP Support to Training Library #213

aldopareja · 2024-09-18T18:44:55Z

Adds support for FSDP and FSDP w/ CPU Offloading.

Introduces accelerate as a distributed backend abstraction (for FSDP/DeepSpeed)
Also fixes mistral template and cleans up data processing.

-Mustafa

src/instructlab/training/config.py

src/instructlab/training/main_ds.py

src/instructlab/training/utils.py

mergify · 2024-09-25T03:12:17Z

This pull request has merge conflicts that must be resolved before it can be
merged. @aldopareja please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>

…ining_backend to TrainingArgs.distributed_backend and DistributedTrainingBackend to DistributedBackend Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>

Signed-off-by: Mustafa Eyceoz <meyceoz@redhat.com>

JamesKunstle · 2024-09-26T00:49:55Z

src/instructlab/training/config.py

@@ -157,6 +181,12 @@ class TrainingArgs(BaseModel):
            cpu_offload_optimizer_pin_memory=False,
        )
    )
+    fsdp_options: FSDPOptions = Field(


does this need to be a factory? I think it can just be an assignment

I'm following the current convention set by DeepSpeedOptions in the file, so imo if we want to change this, we should make a follow-up PR that updates both of them

JamesKunstle · 2024-09-26T01:00:00Z

src/instructlab/training/setup_accelerator.py

+            reduce_dtype=torch.bfloat16,
+            buffer_dtype=torch.bfloat16,
+        ),
+        backward_prefetch=BackwardPrefetch.BACKWARD_PRE,


Do we want to expose this ever? This adds a bit of memory overhead for some performance- I think customarily it's probably a default.

This is a good point, I think it's fine for now, but I will open an issue to track this, as I'm not sure how much of a performance hit compared to memory gain this option will be for us. Might be a nice bonus trick to avoid offloading in some configurations if performance isn't horrendous

Tracked in #228

JamesKunstle · 2024-09-26T01:02:22Z

src/instructlab/training/main_ds.py

-        }
-    return ds_config
+def setup_optimizer(args, model):
+    if args.distributed_training_framework == "fsdp":


The typical way to do this is via this pattern:

Suggested change

if args.distributed_training_framework == "fsdp":

if DistributedBackend(args.distributed_training_framework) == DistributedBackend.FSDP:

This collects "magic strings" like "fsdp" would be into the Enum object.

Note: it actually has to be DistributedBackend.FSDP.value, since by this point the args have gone through the main_ds argparse post-torchrun and args.distributed_training_framework is just a string

Fixed in latest commit

JamesKunstle · 2024-09-26T01:03:30Z

src/instructlab/training/main_ds.py

-            model.parameters(), lr=args.learning_rate, betas=(0.9, 0.95)
-        )
+    accelerator = setup_accelerator(args, model, grad_accum)
+    if args.distributed_training_framework == "fsdp":


Same enum trick here

Note: it actually has to be DistributedBackend.FSDP.value, since by this point the args have gone through the main_ds argparse post-torchrun and args.distributed_training_framework is just a string

Fixed in latest commit

JamesKunstle · 2024-09-26T01:04:24Z

src/instructlab/training/main_ds.py

-        ),
-        lr_scheduler=lr_scheduler,
-        dist_init_required=True,
+    model, optimizer, _, lr_scheduler = accelerator.prepare(


I see here that we're "double preparing" the model- is that okay? Is Accelerate smart enough to handle this?

Yes, I have verified that it is, originally I had some conditionals to avoid it but accelerate was one step ahead

JamesKunstle · 2024-09-26T01:07:16Z

src/instructlab/training/main_ds.py

+                global_grad_norm = accelerator.clip_grad_norm_(model.parameters(), 1.0)
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.zero_grad()


I haven't seen this here conventionally, only at the top of the training loop. I guess it can be either place. I also see that this is where they put it in the docs.

If it aint broke 🤷🏻‍♂️

JamesKunstle

IMO nothing that I noticed is blocking an approval. The only thing that I really want is for this PR to be rebased as a single commit so the history is a bit neater. Once that's done I'll approve!

Signed-off-by: Mustafa Eyceoz <meyceoz@redhat.com>

JamesKunstle

lgtm!

mergify bot added the ci-failure label Sep 18, 2024

aldopareja force-pushed the ap/accelerate-fsdp-tmp2 branch from 560c2ec to 0b4d516 Compare September 18, 2024 19:11

mergify bot added ci-failure dependencies Pull requests that update a dependency file and removed ci-failure labels Sep 18, 2024

Maxusmusti changed the title ~~Ap/accelerate fsdp tmp2~~ Adding FSDP Support to Training Library Sep 24, 2024

mergify bot added ci-failure CI/CD Affects CI/CD configuration and removed ci-failure labels Sep 24, 2024

This was referenced Sep 24, 2024

Ap/accelerate fsdp #212

Closed

Ap/fix mistral template #183

Closed

aldopareja commented Sep 25, 2024

View reviewed changes

mergify bot added the needs-rebase label Sep 25, 2024

Maxusmusti mentioned this pull request Sep 25, 2024

Re-introduce gradnorm/weightnorm logs #223

Open

mergify bot added the one-approval label Sep 25, 2024

minor bug fixes & improvements

7cd6747

Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>

mergify bot added documentation Improvements or additions to documentation ci-failure labels Sep 25, 2024

update docs to include fsdp info, rename TrainingArgs.distributed_tra…

95eb2c0

…ining_backend to TrainingArgs.distributed_backend and DistributedTrainingBackend to DistributedBackend Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>

RobotSail force-pushed the ap/accelerate-fsdp-tmp2 branch from e2b4ae4 to 95eb2c0 Compare September 25, 2024 17:24

mergify bot removed the ci-failure label Sep 25, 2024

Lower max shard size

dd47117

Signed-off-by: Mustafa Eyceoz <meyceoz@redhat.com>

mergify bot added ci-failure and removed ci-failure labels Sep 25, 2024

Remove extra comment

4cdfb8d

Signed-off-by: Mustafa Eyceoz <meyceoz@redhat.com>

mergify bot added ci-failure and removed ci-failure labels Sep 25, 2024

JamesKunstle reviewed Sep 26, 2024

View reviewed changes

Enum security

70ff83c

Signed-off-by: Mustafa Eyceoz <meyceoz@redhat.com>

Maxusmusti added the hold label Sep 26, 2024

JamesKunstle approved these changes Sep 26, 2024

View reviewed changes

mergify bot removed the one-approval label Sep 26, 2024

Maxusmusti merged commit 7b7fa12 into main Sep 26, 2024
14 checks passed

Maxusmusti assigned Maxusmusti, aldopareja and RobotSail Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding FSDP Support to Training Library #213

Adding FSDP Support to Training Library #213

aldopareja commented Sep 18, 2024 •

edited by Maxusmusti

Loading

mergify bot commented Sep 25, 2024

JamesKunstle Sep 26, 2024

Maxusmusti Sep 26, 2024

JamesKunstle Sep 26, 2024

Maxusmusti Sep 26, 2024

Maxusmusti Sep 26, 2024

JamesKunstle Sep 26, 2024

Maxusmusti Sep 26, 2024

Maxusmusti Sep 26, 2024

JamesKunstle Sep 26, 2024

Maxusmusti Sep 26, 2024

Maxusmusti Sep 26, 2024

JamesKunstle Sep 26, 2024

Maxusmusti Sep 26, 2024

JamesKunstle Sep 26, 2024

Maxusmusti Sep 26, 2024

JamesKunstle Sep 26, 2024

JamesKunstle left a comment

JamesKunstle left a comment

	if args.distributed_training_framework == "fsdp":
	if DistributedBackend(args.distributed_training_framework) == DistributedBackend.FSDP:

Adding FSDP Support to Training Library #213

Adding FSDP Support to Training Library #213

Conversation

aldopareja commented Sep 18, 2024 • edited by Maxusmusti Loading

mergify bot commented Sep 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JamesKunstle left a comment

Choose a reason for hiding this comment

JamesKunstle left a comment

Choose a reason for hiding this comment

aldopareja commented Sep 18, 2024 •

edited by Maxusmusti

Loading